Clustering ASVs

Code
library(dplyr) ; library(tidyr) ; library(ggplot2) 
library(doParallel) ; library(foreach) ; library(doSNOW)

root <- rprojroot::has_file(".git/index")
datadir = root$find_file("data")
funsdir = root$find_file("functions")
savingdir = root$find_file("saved_files")

datapath = root$find_file(paste0(datadir,'/','grump.phaeocystis_asv_long.csv'))
files_vec <- list.files(funsdir)
currentwd <- getwd()

for( i in 1:length(files_vec)){
  source(root$find_file(paste0(funsdir,'/',files_vec[i])))
}

funcitons_parallel = c("clusterize_ward_pam",
                       "eval_adist_clustering",
                       "inducedDist",
                       "summarise_adist_clustering",
                       "tidy_grump")

dframe = data.table::fread(input = datapath)
dframe = tidy_grump(Dframe = dframe)

The idea is to create a ‘custom’ distance matrix, to induce the grouppings.

  • \(c_1\) is within the same species - but no unassigned
  • \(c_2\) is between assigned species
  • \(c_3\) is between species and unassigned
  • \(c_4\) is within unassigned and unassigned

For this document we will use only the following configuration:

\(c_1=1\), \(c_2=1000\), \(c_3=10\) , \(c_4=10\)

A priori, the ’probability` of unassigned ASVs to agglomerate between themselves is the same as if it was to agglomerate with other species. Large \(c_2\) makes it harder for known species ASVs to group between themselves.

Code
inducedDist_ = inducedDist(
  dFrame = dframe$dframe,
  c1 = 1,c2=1000,c3=10,c4=10,
  compMatrix = dframe$ASV_composition)

#inducedDist$myDist_checking

\[ S = \frac{numerator}{denominator} \]

Where:

\[ numerator = \frac{1}{k} \left( \sum_{j=1}^{k} \left[ \frac{1}{n_j}\sum_{i}^{n_j} d(x_i-\bar x^{(j)} ) \right] \right) \]

The mean average distance within clusters (each observation and the medoid of the cluster)

\[ denominator = \frac{1}{\binom{k}{2}} \sum_{i \neq j } d( \bar x^{(i)}-\bar x^{(j)} ) \]

The average distance between clusters (each cluster is represented by it’s medoid)

\[ S = \frac{numerator}{denominator} \]

Where:

\[ SS_T = \frac{1}{N} \sum_{i=1}^{N-1} \sum_{j=i+1}^{N-1} d^2_{ij} \]

Being \(d^2_{ij}\) is the squared distance between objects \(i\) and \(j\).

\[ SS_W = \frac{1}{n} \sum_{i=1}^{N-1} \sum_{j=i+1}^{N-1} d^2_{ij}\delta^2_{ij} \]

\(\delta^2_{ij}\) is an indicator function (1 if obs\(i\) and \(j\) are from the same cluster, 0 otherwise). \(SS_A\) Then difference between the overall and the within groups sum-of-squares (\(SS_{A}=SS_{T}-SS_{W}\)).

\[ F = \frac{\frac{SS_{A}}{p-1}}{\frac{SS_{W}}{N-p}} \]

The idea of the new metric is: what if we represent the heterogeneity of a cluster by its maximum distance. Therefore, the average maximum distance within tends to decrease as the number of cluster increases.

In counterpart, let’s think the min distance between clusters as how different two clusters are. Thus, the average minimum distance between clusters also tends to decrease as the number of clusters increase.

Lets call this metric \(T\).

\[T = \frac{\frac{1}{k}\sum_{(i,j)\in k}\max(d(x_i,x_j))} {\frac{1}{k}\sum_{(i,j)\notin k}\min(d(x_i,x_j))}\]

Code
running=F

if(running){
  ## pure biotic factors = 
  list_with_clusters_alpha0 <- clusterize_ward_pam(
    distMatrix = inducedDist_$normalizedAitDist,
    maxnclust = 50,
    dFrameAsv = dframe$ASV_composition$name)
  
  alpha_=0.1
  
  distMatrix_ = alpha_*inducedDist_$normalizedMyDist+(1-alpha_)*inducedDist_$normalizedAitDist
  
  list_with_clusters_alpha0.10 <- clusterize_ward_pam(
    distMatrix = distMatrix_,
    maxnclust = 50,
    dFrameAsv = dframe$ASV_composition$name)

  alpha_=0.25
  
  distMatrix_ = alpha_*inducedDist_$normalizedMyDist+(1-alpha_)*inducedDist_$normalizedAitDist
  
  list_with_clusters_alpha0.25 <- clusterize_ward_pam(
    distMatrix = distMatrix_,
    maxnclust = 50,
    dFrameAsv = dframe$ASV_composition$name)
  
  alpha_=0.5
  
  distMatrix_ = alpha_*inducedDist_$normalizedMyDist+(1-alpha_)*inducedDist_$normalizedAitDist
  
  list_with_clusters_alpha0.50 <- clusterize_ward_pam(
    distMatrix = distMatrix_,
    maxnclust = 50,
    dFrameAsv = dframe$ASV_composition$name)
  
  saveRDS(object = list_with_clusters_alpha0,file = paste0(savingdir,'/','list_with_clusters_alpha0'))
  saveRDS(object = list_with_clusters_alpha0.10,file = paste0(savingdir,'/','list_with_clusters_alpha0.10'))
  saveRDS(object = list_with_clusters_alpha0.25,file = paste0(savingdir,'/','list_with_clusters_alpha0.25'))
  saveRDS(object = list_with_clusters_alpha0.50,file = paste0(savingdir,'/','list_with_clusters_alpha0.50'))
}

list_with_clusters_alpha0 = readRDS(file = paste0(savingdir,'/','list_with_clusters_alpha0'))
list_with_clusters_alpha0.10 = readRDS(file = paste0(savingdir,'/','list_with_clusters_alpha0.10'))
list_with_clusters_alpha0.25 = readRDS(file = paste0(savingdir,'/','list_with_clusters_alpha0.25'))
list_with_clusters_alpha0.50 = readRDS(file = paste0(savingdir,'/','list_with_clusters_alpha0.50'))
Code
running = F
###################################### ------------------------------------ hclust
if(running){
  nclusters <- ncol(list_with_clusters_alpha0$dFrameAsv_hclust)-1
  list_results <- list()
  
  for( i in 1:nclusters){
    list_results[[i]] <-  eval_adist_clustering(compositionDF = left_join(
      list_with_clusters_alpha0$dFrameAsv_hclust[,c(1,i+1)] %>% rename('name'=1,'Cluster'=2),
      dframe$ASV_composition))
    cat('iteration------:',i,'\n')
  }
  saveRDS(list_results,file = paste0(savingdir,'/','list_results_eval_alpha0_hclust'))
}

if(running){
  nclusters <- ncol(list_with_clusters_alpha0.10$dFrameAsv_hclust)-1
  list_results <- list()
  
  for( i in 1:nclusters){
    list_results[[i]] <-  eval_adist_clustering(compositionDF = left_join(
      list_with_clusters_alpha0.10$dFrameAsv_hclust[,c(1,i+1)] %>% rename('name'=1,'Cluster'=2),
      dframe$ASV_composition))
    cat('iteration------:',i,'\n')
  }
  saveRDS(list_results,file = paste0(savingdir,'/','list_results_eval_alpha0.10_hclust'))
}

if(running){
  nclusters <- ncol(list_with_clusters_alpha0.25$dFrameAsv_hclust)-1
  list_results <- list()
  
  for( i in 1:nclusters){
    list_results[[i]] <-  eval_adist_clustering(compositionDF = left_join(
      list_with_clusters_alpha0.25$dFrameAsv_hclust[,c(1,i+1)] %>% rename('name'=1,'Cluster'=2),
      dframe$ASV_composition))
    cat('iteration------:',i,'\n')
  }
  saveRDS(list_results,file = paste0(savingdir,'/','list_results_eval_alpha0.25_hclust'))
}

if(running){
  nclusters <- ncol(list_with_clusters_alpha0.50$dFrameAsv_hclust)-1
  list_results <- list()
  
  for( i in 1:nclusters){
    list_results[[i]] <-  eval_adist_clustering(compositionDF = left_join(
      list_with_clusters_alpha0.50$dFrameAsv_hclust[,c(1,i+1)] %>% rename('name'=1,'Cluster'=2),
      dframe$ASV_composition))
    cat('iteration------:',i,'\n')
  }
  saveRDS(list_results,file = paste0(savingdir,'/','list_results_eval_alpha0.50_hclust'))
}

###################################### ------------------------------------ pam
running =F

if(running){
  nclusters <- ncol(list_with_clusters_alpha0$dFrameAsv_pam)-1
  list_results <- list()
  
  for( i in 1:nclusters){
    list_results[[i]] <-  eval_adist_clustering(compositionDF = left_join(
      list_with_clusters_alpha0$dFrameAsv_pam[,c(1,i+1)] %>% rename('name'=1,'Cluster'=2),
      dframe$ASV_composition))
    cat('iteration------:',i,'\n')
  }
  saveRDS(list_results,file = paste0(savingdir,'/','list_results_eval_alpha0_pam'))
}

if(running){
  nclusters <- ncol(list_with_clusters_alpha0.10$dFrameAsv_pam)-1
  list_results <- list()
  
  for( i in 1:nclusters){
    list_results[[i]] <-  eval_adist_clustering(compositionDF = left_join(
      list_with_clusters_alpha0.10$dFrameAsv_pam[,c(1,i+1)] %>% rename('name'=1,'Cluster'=2),
      dframe$ASV_composition))
    cat('iteration------:',i,'\n')
  }
  saveRDS(list_results,file = paste0(savingdir,'/','list_results_eval_alpha0.10_pam'))
}

if(running){
  nclusters <- ncol(list_with_clusters_alpha0.25$dFrameAsv_pam)-1
  list_results <- list()
  
  for( i in 1:nclusters){
    list_results[[i]] <-  eval_adist_clustering(compositionDF = left_join(
      list_with_clusters_alpha0.25$dFrameAsv_pam[,c(1,i+1)] %>% rename('name'=1,'Cluster'=2),
      dframe$ASV_composition))
    cat('iteration------:',i,'\n')
  }
  saveRDS(list_results,file = paste0(savingdir,'/','list_results_eval_alpha0.25_pam'))
}

if(running){
  nclusters <- ncol(list_with_clusters_alpha0.50$dFrameAsv_pam)-1
  list_results <- list()
  
  for( i in 1:nclusters){
    list_results[[i]] <-  eval_adist_clustering(compositionDF = left_join(
      list_with_clusters_alpha0.50$dFrameAsv_pam[,c(1,i+1)] %>% rename('name'=1,'Cluster'=2),
      dframe$ASV_composition))
    cat('iteration------:',i,'\n')
  }
  saveRDS(list_results,file = paste0(savingdir,'/','list_results_eval_alpha0.50_pam'))
}

list_results_eval_alpha0_hclust = readRDS(paste0(savingdir,'/','list_results_eval_alpha0_hclust'))
list_results_eval_alpha0.10_hclust = readRDS(paste0(savingdir,'/','list_results_eval_alpha0.10_hclust'))
list_results_eval_alpha0.25_hclust = readRDS(paste0(savingdir,'/','list_results_eval_alpha0.25_hclust'))
list_results_eval_alpha0.50_hclust = readRDS(paste0(savingdir,'/','list_results_eval_alpha0.50_hclust'))

list_results_eval_alpha0_pam = readRDS(paste0(savingdir,'/','list_results_eval_alpha0_pam'))
list_results_eval_alpha0.10_pam = readRDS(paste0(savingdir,'/','list_results_eval_alpha0.10_pam'))
list_results_eval_alpha0.25_pam= readRDS(paste0(savingdir,'/','list_results_eval_alpha0.25_pam'))
list_results_eval_alpha0.50_pam = readRDS(paste0(savingdir,'/','list_results_eval_alpha0.50_pam'))
Code
eval_clusters_full = bind_rows(
  summarise_adist_clustering(list_results_eval_alpha0_hclust) %>% mutate(method='hclust',alpha=0),
  summarise_adist_clustering(list_results_eval_alpha0.10_hclust) %>% mutate(method='hclust',alpha=0.10),
  summarise_adist_clustering(list_results_eval_alpha0.25_hclust) %>% mutate(method='hclust',alpha=0.25),
  summarise_adist_clustering(list_results_eval_alpha0.50_hclust) %>% mutate(method='hclust',alpha=0.50),
  summarise_adist_clustering(list_results_eval_alpha0_pam) %>% mutate(method='pam',alpha=0),
  summarise_adist_clustering(list_results_eval_alpha0.10_pam) %>% mutate(method='pam',alpha=0.10),
  summarise_adist_clustering(list_results_eval_alpha0.25_pam) %>% mutate(method='pam',alpha=0.25),
  summarise_adist_clustering(list_results_eval_alpha0.50_pam) %>% mutate(method='pam',alpha=0.50))
Code
eval_clusters_full %>% select(nclust,alpha,method,ratio_avg_within_between) %>% 
  filter(nclust>3) %>% mutate(alpha=factor(alpha,ordered = T,labels = c(0,0.10,0.25,0.50))) %>% 
  pivot_longer(cols = -c(1,2,3),names_to = 'metric') %>%
  ggplot(aes(x = nclust, y = value,col=method,linetype=alpha)) +
    geom_line() +
    theme_minimal(base_size = 12) +
    facet_wrap(~metric, scales = "free_y")+
  theme(legend.position = 'bottom')+
  geom_vline(xintercept = 12, linetype="dotted", 
                color = "blue")+
  geom_vline(xintercept = 22, linetype="dotted", 
                color = "red")

Code
eval_clusters_full %>% select(
  nclust,alpha,method,
  mean_avg_within,
  avg_between_medoids
  ,ratio_avg_within_between) %>% 
  filter(nclust>3) %>% mutate(alpha=factor(alpha,ordered = T,labels = c(0,0.10,0.25,0.50))) %>% 
  pivot_longer(cols = -c(1,2,3),names_to = 'metric') %>%
  ggplot(aes(x = nclust, y = value,col=method,linetype=alpha)) +
  geom_line() +
  theme_minimal(base_size = 12) +
  facet_wrap(~metric, scales = "free_y")+
  theme(legend.position = 'bottom')+
  geom_vline(xintercept = 12, linetype="dotted", 
             color = "blue")+
  geom_vline(xintercept = 22, linetype="dotted", 
             color = "red")

Code
eval_clusters_full %>% select(nclust,alpha,method,avg_max_dist_within,
                              avg_min_dist_clust_i_other,ratio_avg_max_min) %>% 
  filter(nclust>3) %>% mutate(alpha=factor(alpha,ordered = T,labels = c(0,0.10,0.25,0.50))) %>% 
  pivot_longer(cols = -c(1,2,3),names_to = 'metric') %>%
  ggplot(aes(x = nclust, y = value,col=method,linetype=alpha)) +
    geom_line() +
    theme_minimal(base_size = 12) +
    facet_wrap(~metric, scales = "free_y")

Code
cv_clustering <- readRDS(paste0(savingdir,'/','cv_clustering'))
cv_clustering_df = data.table::rbindlist(cv_clustering)
#cv_clustering_df %>% select(method,nclust,alpha0) %>% distinct_all()

group_summary = c('ratio_avg_within_between','ratio_avg_max_min')

group_permanova = c('Fstat','SSA','SSW')
group_S = c('ratio_avg_within_between','mean_avg_within','avg_between_medoids')

group_max = c('max_max_dist_within','avg_max_dist_within','max_min_dist_clust_i_other',
              'avg_min_dist_clust_i_other')

cv_clustering_df_longer = cv_clustering_df %>%
  mutate(alpha=factor(alpha0,ordered = T,
                      labels = c('0.00','0.10','0.25','0.50'))) %>% 
  reshape2::melt(id.vars = c("nclust","method","iteration",'alpha') ) %>%
  filter(nclust > 3) %>% 
  select(-iteration) %>% 
  group_by(method,nclust,variable,alpha) %>% 
  summarise(avg=mean(value),stdev=sd(value)) %>% 
  ungroup()
  
cv_clustering_df_longer %>%
  filter(variable %in% group_summary[1]) %>% 
  ggplot(aes(x = nclust, y = avg,group=method,color=method,fill=method),alpha=0.1) +
  geom_line(aes(linetype=alpha)) +
  geom_ribbon(aes(ymin=avg-stdev, ymax=avg+stdev),alpha=0.25)+
  theme_minimal(base_size = 16) +
  facet_grid(method~alpha, scales = "free_y")+
  theme(legend.position = 'bottom')+
  scale_color_discrete()+
  geom_vline(xintercept = 12, linetype="dotted", 
             color = "blue")+
  geom_vline(xintercept = 22, linetype="dotted", 
             color = "red")

I would pick 22 as the best number of clusters.

Code
cv_clustering_df_longer %>%
  filter(variable %in% group_summary[2]) %>% 
  filter(nclust>5) %>% 
  ggplot(aes(x = nclust, y = avg,group=method,
             color=method,fill=method),alpha=0.1) +
  geom_line(aes(linetype=alpha)) +
  geom_ribbon(aes(ymin=avg-stdev, ymax=avg+stdev),alpha=0.25)+
  theme_minimal(base_size = 16) +
  facet_grid(method~alpha, scales = "free_y")+
  theme(legend.position = 'bottom')+
  scale_color_discrete()+
  geom_vline(xintercept = 15, linetype="dotted", 
             color = "blue")+
  geom_vline(xintercept = 22, linetype="dotted", 
             color = "red")

Here 15 feels like the best choice for the number of clusters.

With mosaic

Code

Here you will only find the scripts to generate the results

Code
############# -- arranging data

############# Finding K -------------------------------------------------------------------
running = F
if (running){
  set.seed(1234)
  alpha0_bootstrap = finding_optimal_k_parallel(
    dframe_asv = dframe$dframe %>% select(SampleKey,ASV_name,Raw.Sequence.Counts),
    B = 500,split_pct = 0.5,maxnclust_ = 50,vec_functions_fromEnv = funcitons_parallel)
  saveRDS(alpha0_bootstrap,file = paste0(savingdir,'/','alpha0_bootstrap'))
}

running = F
if (running){
  set.seed(1234)
  alpha0_bootstrap = finding_optimal_k_parallel(
    dframe_asv = dframe$dframe %>% select(SampleKey,ASV_name,Raw.Sequence.Counts),
    B = 500,split_pct = 0.5,maxnclust_ = 50,vec_functions_fromEnv = funcitons_parallel,
    inducedDistMatrix = inducedDist_$normalizedMyDist,alpha_ = 0.10)
  saveRDS(alpha0_bootstrap,file = paste0(savingdir,'/','alpha0.10_bootstrap'))
}


running = F
if (running){
  set.seed(1234)
  alpha0_bootstrap = finding_optimal_k_parallel(
    dframe_asv = dframe$dframe %>% select(SampleKey,ASV_name,Raw.Sequence.Counts),
    B = 500,split_pct = 0.5,maxnclust_ = 50,vec_functions_fromEnv = funcitons_parallel,
    inducedDistMatrix = inducedDist_$normalizedMyDist,alpha_ = 0.25)
  saveRDS(alpha0_bootstrap,file = paste0(savingdir,'/','alpha0.25_bootstrap'))
}

running = F
if (running){
  set.seed(1234)
  alpha0_bootstrap = finding_optimal_k_parallel(
    dframe_asv = dframe$dframe %>% select(SampleKey,ASV_name,Raw.Sequence.Counts),
    B = 500,split_pct = 0.5,maxnclust_ = 50,vec_functions_fromEnv = funcitons_parallel,
    inducedDistMatrix = inducedDist_$normalizedMyDist,alpha_ = 0.50)
  saveRDS(alpha0_bootstrap,file = paste0(savingdir,'/','alpha0.50_bootstrap'))
}
############# Finding K -------------------------------------------------------------------
Code
############# Permanova -------------------------------------------------------------------
running = F
if(running){
perm_alpha0 = permanova_custom(
  list_with_clusters = list_with_clusters_alpha0,
  ait_distMatrix = inducedDist_$normalizedAitDist,
  B = 500)
saveRDS(perm_alpha0,file = paste0(savingdir,'/','perm_alpha0'))
}

running = F
if(running){
perm_alpha0.10 = permanova_custom(
  list_with_clusters = list_with_clusters_alpha0.10,
  ait_distMatrix = inducedDist_$normalizedAitDist,
  B = 500)
saveRDS(perm_alpha0.10,file = paste0(savingdir,'/','perm_alpha0.10'))
}

running = F
if(running){
perm_alpha0.25 = permanova_custom(
  list_with_clusters = list_with_clusters_alpha0.25,
  ait_distMatrix = inducedDist_$normalizedAitDist,
  B = 500)
saveRDS(perm_alpha0.25,file = paste0(savingdir,'/','perm_alpha0.25'))
}

running = F
if(running){
perm_alpha0.50 = permanova_custom(
  list_with_clusters = list_with_clusters_alpha0.50,
  ait_distMatrix = inducedDist_$normalizedAitDist,
  B = 500)
saveRDS(perm_alpha0.50,file = paste0(savingdir,'/','perm_alpha0.50'))
}
Code
## arranging the results

############# Finding K -------------------------------------------------------------------
alpha0_cv = readRDS(paste0(savingdir,'/','alpha0_bootstrap'))
alpha0_cv_df = data.table::rbindlist(alpha0_cv) %>% data.frame()

Results

‘CV-like’ measurement was calculated at 500 iterations of:

  1. After the split

    • create the composition of ASVs in 80% of the samples
    • fit the clustering methods (hclust and pam) from k = 1 : 50
    • evaluate each cluster in the ‘test’ data
Code
# alpha0_cv_df %>% 
#   reshape2::melt(id.vars = c("nclust","method","iteration") ) %>%
#   filter(nclust > 1) %>% 
#   select(-iteration) %>% select(variable) %>% distinct() %>% pull()

group_summary = c('ratio_avg_within_between','ratio_avg_max_min')

group_permanova = c('Fstat','SSA','SSW')
group_S = c('ratio_avg_within_between','mean_avg_within','avg_between_medoids')

group_max = c('max_max_dist_within','avg_max_dist_within','max_min_dist_clust_i_other',
              'avg_min_dist_clust_i_other')
Code
alpha0_cv_df %>% 
  reshape2::melt(id.vars = c("nclust","method","iteration") ) %>%
  filter(nclust > 1) %>% 
  select(-iteration) %>% 
  filter(variable %in% group_summary) %>% 
  group_by(method,nclust,variable) %>% 
  summarise(avg=mean(value),stdev=sd(value)) %>% 
  ungroup() %>% 
  ggplot(aes(x = nclust, y = avg, group = method,
             color=method,fill=method)) +
  geom_line() +
  geom_ribbon(aes(ymin=avg-stdev, ymax=avg+stdev),alpha=0.5)+
  theme_minimal(base_size = 16) +
  facet_wrap(~variable, scales = "free_y")+
  theme(legend.position = 'bottom')

Code
alpha0_cv_df %>% 
  reshape2::melt(id.vars = c("nclust","method","iteration") ) %>%
  filter(nclust > 1) %>% 
  select(-iteration) %>% 
  filter(variable %in% group_permanova) %>% 
  group_by(method,nclust,variable) %>% 
  summarise(avg=mean(value),stdev=sd(value)) %>% 
  ungroup() %>% 
  ggplot(aes(x = nclust, y = avg, group = method,
             color=method,fill=method)) +
  geom_line() +
  geom_ribbon(aes(ymin=avg-stdev, ymax=avg+stdev),alpha=0.5)+
  theme_minimal(base_size = 16) +
  facet_wrap(~variable, scales = "free_y")+
  theme(legend.position = 'bottom')

Code
alpha0_cv_df %>% 
  reshape2::melt(id.vars = c("nclust","method","iteration") ) %>%
  filter(nclust > 1) %>% 
  select(-iteration) %>% 
  filter(variable %in% group_S) %>% 
  group_by(method,nclust,variable) %>% 
  summarise(avg=mean(value),stdev=sd(value)) %>% 
  ungroup() %>% 
  ggplot(aes(x = nclust, y = avg, group = method,
             color=method,fill=method)) +
  geom_line() +
  geom_ribbon(aes(ymin=avg-stdev, ymax=avg+stdev),alpha=0.5)+
  theme_minimal(base_size = 16) +
  facet_wrap(~variable, scales = "free_y")+
  theme(legend.position = 'bottom')

Code
alpha0_cv_df %>% 
  reshape2::melt(id.vars = c("nclust","method","iteration") ) %>%
  filter(nclust > 1) %>% 
  select(-iteration) %>% 
  filter(variable %in% group_max) %>% 
  group_by(method,nclust,variable) %>% 
  summarise(avg=mean(value),stdev=sd(value)) %>% 
  ungroup() %>% 
  ggplot(aes(x = nclust, y = avg, group = method,
             color=method,fill=method)) +
  geom_line() +
  geom_ribbon(aes(ymin=avg-stdev, ymax=avg+stdev),alpha=0.5)+
  theme_minimal(base_size = 16) +
  facet_wrap(~variable, scales = "free_y")+
  theme(legend.position = 'bottom')

Code
alpha0_cv_df %>% 
  reshape2::melt(id.vars = c("nclust","method","iteration") ) %>%
  filter(nclust > 1) %>% 
  select(-iteration) %>% 
  #mutate(nclust=factor(nclust,levels=1:number_clusters)) %>% 
  group_by(method,nclust,variable) %>% 
  #summarise(avg=mean(value),stdev=sd(value)) %>% 
  #ungroup() %>% 
  #group_by(method,nclust,variable) %>% 
  summarise(Freq_Problem = sum(is.infinite(value))) %>% 
  filter(Freq_Problem>0) %>% 
  knitr::kable() %>% kableExtra::kable_styling()
method nclust variable Freq_Problem
hclustD 2 ratio_avg_within_between 74
hclustD 2 ratio_avg_max_min 75
hclustD 3 ratio_avg_max_min 74
pam 2 ratio_avg_within_between 82
pam 2 ratio_avg_max_min 82
pam 3 ratio_avg_within_between 27
pam 3 ratio_avg_max_min 27
pam 4 ratio_avg_within_between 2
pam 4 ratio_avg_max_min 2
Code
alpha0_cv_df %>% 
  reshape2::melt(id.vars = c("nclust","method","iteration") ) %>%
  filter(nclust > 1) %>% 
  select(-iteration) %>% 
  #mutate(nclust=factor(nclust,levels=1:number_clusters)) %>% 
  group_by(method,nclust,variable) %>% 
  #summarise(avg=mean(value),stdev=sd(value)) %>% 
  #ungroup() %>% 
  #group_by(method,nclust,variable) %>% 
  summarise(Freq_Problem = sum(is.infinite(value))) %>% 
  filter(Freq_Problem>0) %>% 
  knitr::kable() %>% kableExtra::kable_styling()
method nclust variable Freq_Problem
hclustD 2 ratio_avg_within_between 74
hclustD 2 ratio_avg_max_min 75
hclustD 3 ratio_avg_max_min 74
pam 2 ratio_avg_within_between 82
pam 2 ratio_avg_max_min 82
pam 3 ratio_avg_within_between 27
pam 3 ratio_avg_max_min 27
pam 4 ratio_avg_within_between 2
pam 4 ratio_avg_max_min 2
Code
alpha0_cv_df %>% 
  reshape2::melt(id.vars = c("nclust","method","iteration") ) %>%
  filter(nclust > 1) %>% 
  select(-iteration) %>% 
  #mutate(nclust=factor(nclust,levels=1:number_clusters)) %>% 
  group_by(method,nclust,variable) %>% 
  #summarise(avg=mean(value),stdev=sd(value)) %>% 
  #ungroup() %>% 
  #group_by(method,nclust,variable) %>% 
  summarise(Freq_Problem = sum(value==0)) %>% 
  filter(Freq_Problem>0) %>% 
  knitr::kable() %>% kableExtra::kable_styling()
method nclust variable Freq_Problem
hclustD 2 avg_between_medoids 74
hclustD 2 max_min_dist_clust_i_other 75
hclustD 2 avg_min_dist_clust_i_other 75
hclustD 3 max_min_dist_clust_i_other 74
hclustD 3 avg_min_dist_clust_i_other 74
pam 2 avg_between_medoids 82
pam 2 max_min_dist_clust_i_other 82
pam 2 avg_min_dist_clust_i_other 82
pam 3 avg_between_medoids 27
pam 3 max_min_dist_clust_i_other 27
pam 3 avg_min_dist_clust_i_other 27
pam 4 avg_between_medoids 2
pam 4 max_min_dist_clust_i_other 2
pam 4 avg_min_dist_clust_i_other 2
Code
alpha0_cv_df %>% 
  reshape2::melt(id.vars = c("nclust","method","iteration") ) %>%
  filter(nclust > 1) %>% 
  select(-iteration) %>% 
  #mutate(nclust=factor(nclust,levels=1:number_clusters)) %>% 
  group_by(method,nclust,variable) %>% 
  summarise(avg=mean(value),stdev=sd(value)) %>% 
  ungroup() %>% 
  ggplot(aes(x = nclust, y = avg, group = method,color=method,fill=method)) +
    geom_line() +
    geom_ribbon(aes(ymin=avg-stdev, ymax=avg+stdev),alpha=0.5)+
    theme_minimal(base_size = 12) +
    facet_wrap(~variable, scales = "free_y")+
  theme(legend.position = 'bottom')

Code
############# Permanova -------------------------------------------------------------------
perm_alpha <- readRDS(paste0(savingdir,'/','perm_alpha0'))

df_permanova <- plyr::ldply(perm_alpha$hclust, function(el){
  return(data.frame(nclust = el$nclust[1], pvalue = el$`Pr(>F)`[1], r2 = el$R2[1]))
}, .id = "nclust")

df_permanova2 <- plyr::ldply(perm_alpha$pam, function(el){
  return(data.frame(nclust = el$nclust[1], pvalue = el$`Pr(>F)`[1], r2 = el$R2[1]))
}, .id = "nclust")


df_perm_alpha0 = bind_rows(
  df_permanova %>% mutate(method='hclust'),
  df_permanova2 %>% mutate(method='pam'),
) %>% mutate(alpha=0)

########### 0.1 

perm_alpha <- readRDS(paste0(savingdir,'/','perm_alpha0.10'))

df_permanova <- plyr::ldply(perm_alpha$hclust, function(el){
  return(data.frame(nclust = el$nclust[1], pvalue = el$`Pr(>F)`[1], r2 = el$R2[1]))
}, .id = "nclust")

df_permanova2 <- plyr::ldply(perm_alpha$pam, function(el){
  return(data.frame(nclust = el$nclust[1], pvalue = el$`Pr(>F)`[1], r2 = el$R2[1]))
}, .id = "nclust")


df_perm_alpha0.10 = bind_rows(
  df_permanova %>% mutate(method='hclust'),
  df_permanova2 %>% mutate(method='pam'),
) %>% mutate(alpha=0.10)

########### 0.25
perm_alpha <- readRDS(paste0(savingdir,'/','perm_alpha0.25'))

df_permanova <- plyr::ldply(perm_alpha$hclust, function(el){
  return(data.frame(nclust = el$nclust[1], pvalue = el$`Pr(>F)`[1], r2 = el$R2[1]))
}, .id = "nclust")

df_permanova2 <- plyr::ldply(perm_alpha$pam, function(el){
  return(data.frame(nclust = el$nclust[1], pvalue = el$`Pr(>F)`[1], r2 = el$R2[1]))
}, .id = "nclust")

df_perm_alpha0.25 = bind_rows(
  df_permanova %>% mutate(method='hclust'),
  df_permanova2 %>% mutate(method='pam'),
) %>% mutate(alpha=0.25)

########### 0.50
perm_alpha <- readRDS(paste0(savingdir,'/','perm_alpha0.50'))

df_permanova <- plyr::ldply(perm_alpha$hclust, function(el){
  return(data.frame(nclust = el$nclust[1], pvalue = el$`Pr(>F)`[1], r2 = el$R2[1]))
}, .id = "nclust")

df_permanova2 <- plyr::ldply(perm_alpha$pam, function(el){
  return(data.frame(nclust = el$nclust[1], pvalue = el$`Pr(>F)`[1], r2 = el$R2[1]))
}, .id = "nclust")

df_perm_alpha0.50 = bind_rows(
  df_permanova %>% mutate(method='hclust'),
  df_permanova2 %>% mutate(method='pam'),
) %>% mutate(alpha=0.50)


df_perm  = bind_rows(df_perm_alpha0,df_perm_alpha0.10,df_perm_alpha0.25,df_perm_alpha0.50)
Code
df_perm %>% filter(nclust>1) %>% 
  mutate(alpha_=factor(alpha)) %>% 
  ggplot(aes(
  x=nclust,y=r2,color=method,linetype=alpha_
  ))+
  geom_line()+
  theme_minimal()+
  theme(legend.position = 'bottom')

Code
stable_obj_alpha0= readRDS(paste0(savingdir,'/','stable_obj_alpha0'))
stable_obj_alpha0.10= readRDS(paste0(savingdir,'/','stable_obj_alpha0.10'))
stable_obj_alpha0.25= readRDS(paste0(savingdir,'/','stable_obj_alpha0.25'))
stable_obj_alpha0.50= readRDS(paste0(savingdir,'/','stable_obj_alpha0.50'))

stable_obj_alpha0_n22 = readRDS(paste0(savingdir,'/','stable_obj_alpha0_n22'))
stable_obj_alpha0.10_n22= readRDS(paste0(savingdir,'/','stable_obj_alpha0.10_n22'))
stable_obj_alpha0.25_n22= readRDS(paste0(savingdir,'/','stable_obj_alpha0.25_n22'))
stable_obj_alpha0.50_n22= readRDS(paste0(savingdir,'/','stable_obj_alpha0.50_n22'))



pmatrix_a0 <- reorder_to_matrix_plot(stable_obj_alpha0_n22)
pmatrix_a0.10 <- reorder_to_matrix_plot(stable_obj_alpha0.10_n22)
pmatrix_a0.25 <- reorder_to_matrix_plot(stable_obj_alpha0.25_n22)
pmatrix_a0.50 <- reorder_to_matrix_plot(stable_obj_alpha0.50_n22)
Code
pallete10 <- colorRampPalette(c('white','purple4'))(4)
#pallete10 = c("#FFFFFF",pallete10)

hclus_0 <- stable_obj_alpha0_n22$hclust_
hclus_0.10 <- stable_obj_alpha0.10_n22$hclust_
hclus_0.25 <- stable_obj_alpha0.25_n22$hclust_
hclus_0.50 <- stable_obj_alpha0.50_n22$hclust_


dm = ggdendroplot::hmReady(df = hclus_0) %>% mutate(simm=cut(value,ordered_result = T,
                 breaks = seq(0,1,0.25),include.lowest = T,right = T))

dm2 = ggdendroplot::hmReady(df = hclus_0.10) %>% mutate(simm=cut(value,ordered_result = T,
                 breaks = seq(0,1,0.25),include.lowest = T,right = T))

dm3 = ggdendroplot::hmReady(df = hclus_0.25) %>% mutate(simm=cut(value,ordered_result = T,
                 breaks = seq(0,1,0.25),include.lowest = T,right = T))

dm4 = ggdendroplot::hmReady(df = hclus_0.50) %>% mutate(simm=cut(value,ordered_result = T,
                 breaks = seq(0,1,0.25),include.lowest = T,right = T))

p1 = dm %>% ggplot(aes(x=rowid, y=variable, fill=simm)) + 
  geom_tile() + 
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank())+
  scale_fill_manual(values = pallete10)+
  ggtitle('alpha=0')
  

p2 = dm2 %>% ggplot(aes(x=rowid, y=variable, fill=simm)) + 
 geom_tile() + 
 theme_minimal() +
 theme(
   axis.text.x = element_blank(),
   axis.text.y = element_blank(),
   axis.ticks = element_blank())+
 scale_fill_manual(values = pallete10)+
  ggtitle('alpha=0.1')

p3 = dm3 %>% ggplot(aes(x=rowid, y=variable, fill=simm)) + 
 geom_tile() + 
 theme_minimal() +
 theme(
   axis.text.x = element_blank(),
   axis.text.y = element_blank(),
   axis.ticks = element_blank())+
 scale_fill_manual(values = pallete10)+
  ggtitle('alpha=0.25')

p4 = dm4 %>% ggplot(aes(x=rowid, y=variable, fill=simm)) + 
 geom_tile() + 
 theme_minimal() +
 theme(
   axis.text.x = element_blank(),
   axis.text.y = element_blank(),
   axis.ticks = element_blank())+
 scale_fill_manual(values = pallete10)+
  ggtitle('alpha=0.50')

p1
p2
p3
p4

Now reordering

Code
pmatrix_a0$p_hclust
pmatrix_a0.10$p_hclust
pmatrix_a0.25$p_hclust
pmatrix_a0.50$p_hclust

Code
p1 = dm %>%
  group_by(simm) %>% 
  summarise(Pct = n()/nrow(dm)) %>%
  ggplot(aes(x=simm,y=Pct))+
  geom_col()+
  theme_minimal()+
  ggtitle('alpha=0')
  
p2 = dm2 %>%
  group_by(simm) %>% 
  summarise(Pct = n()/nrow(dm)) %>%
  ggplot(aes(x=simm,y=Pct))+
  geom_col()+
  theme_minimal()+
  ggtitle('alpha=0.1')

p3 = dm3 %>%
  group_by(simm) %>% 
  summarise(Pct = n()/nrow(dm)) %>%
  ggplot(aes(x=simm,y=Pct))+
  geom_col()+
  theme_minimal()+
  ggtitle('alpha=0.25')

p4 = dm4 %>%
  group_by(simm) %>% 
  summarise(Pct = n()/nrow(dm)) %>%
  ggplot(aes(x=simm,y=Pct))+
  geom_col()+
  theme_minimal()+
  ggtitle('alpha=0.50')

gridExtra::grid.arrange(p1,p2,p3,p4,ncol=4)

Let’s reorder to get a better visualization.

Code
df_asv_species = dframe$ASV_composition %>% transmute(ASV_name=name) %>% 
   left_join(dframe$dframe %>% select(ASV_name,Species) %>% distinct()) %>% 
   filter(!is.na(Species)) %>% 
   arrange(Species)

df_asv_species %>% group_by(Species) %>% 
  summarise(Freq=n()) %>% mutate(Cumm_Freq = cumsum(Freq)) %>%
  knitr::kable() %>% kableExtra::kable_styling()
Species Freq Cumm_Freq
Phaeocystis_antarctica 9 9
Phaeocystis_cordata 4 13
Phaeocystis_globosa 34 47
Phaeocystis_pouchetii 3 50
Phaeocystis_rex 4 54
Code
p1 <- plot_knowns_matrix(dm_aux = dm,df_ordered_ASVs = df_asv_species) + ggtitle('alpha=0')
p2 <- plot_knowns_matrix(dm_aux = dm2,df_ordered_ASVs = df_asv_species) +ggtitle('alpha=0.10')
p3 <- plot_knowns_matrix(dm_aux = dm3,df_ordered_ASVs = df_asv_species) +ggtitle('alpha=0.25')
p4 <- plot_knowns_matrix(dm_aux = dm4,df_ordered_ASVs = df_asv_species) +ggtitle('alpha=0.50')

# p_all <- ggpubr::ggarrange(p1,p2,p3,p4,common.legend = T,ncol=2,nrow=2)
# p_all
p1
p2
p3
p4

Code
p1 <- plot_knowns_vs_unknowns(dm_aux = dm,df_ordered_ASVs = df_asv_species) + ggtitle('alpha=0')
p2 <- plot_knowns_vs_unknowns(dm_aux = dm2,df_ordered_ASVs = df_asv_species) +ggtitle('alpha=0.10')
p3 <- plot_knowns_vs_unknowns(dm_aux = dm3,df_ordered_ASVs = df_asv_species) +ggtitle('alpha=0.25')
p4 <- plot_knowns_vs_unknowns(dm_aux = dm4,df_ordered_ASVs = df_asv_species) +ggtitle('alpha=0.50')

p1
p2
p3
p4

Code
vet_species = c("Phaeocystis_antarctica","Phaeocystis_cordata","Phaeocystis_globosa","Phaeocystis_pouchetii","Phaeocystis_rex")

plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0',vet_species[1]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm2,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.1',vet_species[1]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.25',vet_species[1]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.50',vet_species[1]))

Code
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0',vet_species[2]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm2,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.1',vet_species[2]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.25',vet_species[2]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.50',vet_species[2]))

Code
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0',vet_species[3]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm2,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.1',vet_species[3]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.25',vet_species[3]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.50',vet_species[3]))

Code
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0',vet_species[4]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm2,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.1',vet_species[4]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.25',vet_species[4]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.50',vet_species[4]))

Code
vet_species = c("Phaeocystis_antarctica","Phaeocystis_cordata","Phaeocystis_globosa","Phaeocystis_pouchetii","Phaeocystis_rex")

plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0',vet_species[5]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm2,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.1',vet_species[5]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.25',vet_species[5]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.50',vet_species[5]))

Code
vet_species = c("Phaeocystis_antarctica","Phaeocystis_cordata","Phaeocystis_globosa","Phaeocystis_pouchetii","Phaeocystis_rex")

plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0',vet_species[5]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm2,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.1',vet_species[5]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.25',vet_species[5]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.50',vet_species[5]))

Code
pallete10 <- colorRampPalette(c('white','purple4'))(4)
#pallete10 = c("#FFFFFF",pallete10)

hclus_0 <- stable_obj_alpha0_n22$pam_
hclus_0.10 <- stable_obj_alpha0.10_n22$pam_
hclus_0.25 <- stable_obj_alpha0.25_n22$pam_
hclus_0.50 <- stable_obj_alpha0.50_n22$pam_


dm = ggdendroplot::hmReady(df = hclus_0) %>% mutate(simm=cut(value,ordered_result = T,
                 breaks = seq(0,1,0.25),include.lowest = T,right = T))

dm2 = ggdendroplot::hmReady(df = hclus_0.10) %>% mutate(simm=cut(value,ordered_result = T,
                 breaks = seq(0,1,0.25),include.lowest = T,right = T))

dm3 = ggdendroplot::hmReady(df = hclus_0.25) %>% mutate(simm=cut(value,ordered_result = T,
                 breaks = seq(0,1,0.25),include.lowest = T,right = T))

dm4 = ggdendroplot::hmReady(df = hclus_0.50) %>% mutate(simm=cut(value,ordered_result = T,
                 breaks = seq(0,1,0.25),include.lowest = T,right = T))

p1 = dm %>% ggplot(aes(x=rowid, y=variable, fill=simm)) + 
  geom_tile() + 
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank())+
  scale_fill_manual(values = pallete10)+
  ggtitle('alpha=0')
  

p2 = dm2 %>% ggplot(aes(x=rowid, y=variable, fill=simm)) + 
 geom_tile() + 
 theme_minimal() +
 theme(
   axis.text.x = element_blank(),
   axis.text.y = element_blank(),
   axis.ticks = element_blank())+
 scale_fill_manual(values = pallete10)+
  ggtitle('alpha=0.1')

p3 = dm3 %>% ggplot(aes(x=rowid, y=variable, fill=simm)) + 
 geom_tile() + 
 theme_minimal() +
 theme(
   axis.text.x = element_blank(),
   axis.text.y = element_blank(),
   axis.ticks = element_blank())+
 scale_fill_manual(values = pallete10)+
  ggtitle('alpha=0.25')

p4 = dm4 %>% ggplot(aes(x=rowid, y=variable, fill=simm)) + 
 geom_tile() + 
 theme_minimal() +
 theme(
   axis.text.x = element_blank(),
   axis.text.y = element_blank(),
   axis.ticks = element_blank())+
 scale_fill_manual(values = pallete10)+
  ggtitle('alpha=0.50')

p1
p2
p3
p4

Now reordering

Code
pallete10 <- colorRampPalette(c('white','purple4'))(4)
#pallete10 = c("#FFFFFF",pallete10)

pmatrix_a0$p_pam
pmatrix_a0.10$p_pam
pmatrix_a0.25$p_pam
pmatrix_a0.50$p_pam

Code
p1 = dm %>%
  group_by(simm) %>% 
  summarise(Pct = n()/nrow(dm)) %>%
  ggplot(aes(x=simm,y=Pct))+
  geom_col()+
  theme_minimal()+
  ggtitle('alpha=0')
  
p2 = dm2 %>%
  group_by(simm) %>% 
  summarise(Pct = n()/nrow(dm)) %>%
  ggplot(aes(x=simm,y=Pct))+
  geom_col()+
  theme_minimal()+
  ggtitle('alpha=0.1')

p3 = dm3 %>%
  group_by(simm) %>% 
  summarise(Pct = n()/nrow(dm)) %>%
  ggplot(aes(x=simm,y=Pct))+
  geom_col()+
  theme_minimal()+
  ggtitle('alpha=0.25')

p4 = dm4 %>%
  group_by(simm) %>% 
  summarise(Pct = n()/nrow(dm)) %>%
  ggplot(aes(x=simm,y=Pct))+
  geom_col()+
  theme_minimal()+
  ggtitle('alpha=0.50')


gridExtra::grid.arrange(p1,p2,p3,p4,ncol=4)

Let’s reorder to get a better visualization.

Code
df_asv_species = dframe$ASV_composition %>% transmute(ASV_name=name) %>% 
   left_join(dframe$dframe %>% select(ASV_name,Species) %>% distinct()) %>% 
   filter(!is.na(Species)) %>% 
   arrange(Species)

df_asv_species %>% group_by(Species) %>% 
  summarise(Freq=n()) %>% mutate(Cumm_Freq = cumsum(Freq)) %>%
  knitr::kable() %>% kableExtra::kable_styling()
Species Freq Cumm_Freq
Phaeocystis_antarctica 9 9
Phaeocystis_cordata 4 13
Phaeocystis_globosa 34 47
Phaeocystis_pouchetii 3 50
Phaeocystis_rex 4 54
Code
p1 <- plot_knowns_matrix(dm_aux = dm,df_ordered_ASVs = df_asv_species) + ggtitle('alpha=0')
p2 <- plot_knowns_matrix(dm_aux = dm2,df_ordered_ASVs = df_asv_species) +ggtitle('alpha=0.10')
p3 <- plot_knowns_matrix(dm_aux = dm3,df_ordered_ASVs = df_asv_species) +ggtitle('alpha=0.25')
p4 <- plot_knowns_matrix(dm_aux = dm4,df_ordered_ASVs = df_asv_species) +ggtitle('alpha=0.50')

p1
p2
p3
p4

Code
p1 <- plot_knowns_vs_unknowns(dm_aux = dm,df_ordered_ASVs = df_asv_species) + ggtitle('alpha=0')
p2 <- plot_knowns_vs_unknowns(dm_aux = dm2,df_ordered_ASVs = df_asv_species) +ggtitle('alpha=0.10')
p3 <- plot_knowns_vs_unknowns(dm_aux = dm3,df_ordered_ASVs = df_asv_species) +ggtitle('alpha=0.25')
p4 <- plot_knowns_vs_unknowns(dm_aux = dm4,df_ordered_ASVs = df_asv_species) +ggtitle('alpha=0.50')

p1
p2
p3
p4

Code
vet_species = c("Phaeocystis_antarctica","Phaeocystis_cordata","Phaeocystis_globosa","Phaeocystis_pouchetii","Phaeocystis_rex")

plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0',vet_species[1]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm2,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.1',vet_species[1]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.25',vet_species[1]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.50',vet_species[1]))

Code
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0',vet_species[2]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm2,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.1',vet_species[2]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.25',vet_species[2]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.50',vet_species[2]))

Code
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0',vet_species[3]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm2,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.1',vet_species[3]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.25',vet_species[3]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.50',vet_species[3]))

Code
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0',vet_species[4]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm2,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.1',vet_species[4]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.25',vet_species[4]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.50',vet_species[4]))

Code
vet_species = c("Phaeocystis_antarctica","Phaeocystis_cordata","Phaeocystis_globosa","Phaeocystis_pouchetii","Phaeocystis_rex")

plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0',vet_species[5]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm2,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.1',vet_species[5]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.25',vet_species[5]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.50',vet_species[5]))

Code
vet_species = c("Phaeocystis_antarctica","Phaeocystis_cordata","Phaeocystis_globosa","Phaeocystis_pouchetii","Phaeocystis_rex")

plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0',vet_species[5]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm2,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.1',vet_species[5]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.25',vet_species[5]))
plot_knowns_vs_unknowns(specific_species = T,
  dm_aux = dm3,df_ordered_ASVs = df_asv_species %>%
    filter(Species==vet_species[1])) + ggtitle(paste('alpha=0.50',vet_species[5]))

Code
inducedDist_ = inducedDist(
  dFrame = dframe$dframe,
  c1 = 1,c2=1000,c3=10,c4=10,
  compMatrix = dframe$ASV_composition)

alpha_ =  0.25
combdist=  alpha_*inducedDist_$normalizedMyDist + (1-alpha_)*inducedDist_$normalizedAitDist
pam_10 = cluster::pam(x = as.dist(combdist),k = 22)
hclust22= cutree(hclust(d = as.dist(combdist),method = 'ward.D'),k=22)

df_clusters = data.frame(
  ASV_name=dframe$ASV_composition$name,
  Cluster = pam_10$clustering)
  #Cluster = hclust22)


asvGroupping <- AnalyzeASVgroup(
  dfLonger = dframe$dframe %>% 
    left_join(df_clusters),
  clusterVar = 'Cluster',
  sampleVar = 'SampleID')
In this dataset we have: 1482 ASVs, and 987 samples 

Comparing the Species and The groupping that we got:

Code
asvGroupping$barplot

Code
asvGroupping$RelativeFreq
Species ASVG_01 ASVG_02 ASVG_03 ASVG_04 ASVG_05 ASVG_06 ASVG_07 ASVG_08 ASVG_09 ASVG_10 ASVG_11 ASVG_12 ASVG_13 ASVG_14 ASVG_15 ASVG_16 ASVG_17 ASVG_18 ASVG_19 ASVG_20 ASVG_21 ASVG_22
Phaeocystis_antarctica 0.0066611 0 0.0000000 0.1666667 0 0.0000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Phaeocystis_cordata 0.0033306 0 0.0000000 0.0000000 0 0.0000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Phaeocystis_pouchetii 0.0024979 0 0.0000000 0.0000000 0 0.0000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Phaeocystis_rex 0.0033306 0 0.0000000 0.0000000 0 0.0000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
NA 0.9841799 0 0.6438356 0.5000000 1 0.9137931 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Phaeocystis_globosa 0.0000000 1 0.3561644 0.3333333 0 0.0862069 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Code
asvGroupping$Freq
Species ASVG_01 ASVG_02 ASVG_03 ASVG_04 ASVG_05 ASVG_06 ASVG_07 ASVG_08 ASVG_09 ASVG_10 ASVG_11 ASVG_12 ASVG_13 ASVG_14 ASVG_15 ASVG_16 ASVG_17 ASVG_18 ASVG_19 ASVG_20 ASVG_21 ASVG_22
Phaeocystis_antarctica 8 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Phaeocystis_cordata 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Phaeocystis_pouchetii 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Phaeocystis_rex 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
NA 1182 0 47 3 1 53 63 52 1 1 4 2 1 1 1 3 1 1 4 2 3 2
Phaeocystis_globosa 0 1 26 2 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Code
asvGroupping$MostAbundantFreq %>% knitr::kable() %>% kableExtra::kable_styling()
Cluster Freq Pct
ASVG_01 210 0.2127660
ASVG_02 152 0.1540020
ASVG_03 102 0.1033435
ASVG_04 126 0.1276596
ASVG_05 2 0.0020263
ASVG_06 11 0.0111449
ASVG_07 39 0.0395137
ASVG_08 15 0.0151976
ASVG_09 148 0.1499493
ASVG_10 90 0.0911854
ASVG_11 3 0.0030395
ASVG_12 8 0.0081054
ASVG_13 4 0.0040527
ASVG_14 2 0.0020263
ASVG_15 1 0.0010132
ASVG_16 2 0.0020263
ASVG_17 3 0.0030395
ASVG_19 63 0.0638298
ASVG_21 1 0.0010132
ASVG_22 5 0.0050659
Code
asvGroupping$mapView

Here, the denominator is the sum of the raw.sequence.counts for each longhurst, and than observe the RA of each ASVG inside each longhurst

Code
asvGroupping$BarPlot_ASVG_LH_RA

Here, the denominator is the sum of the raw.sequence.counts for each ASVG, and than observe the RA of each longhurst inside each ASVG

Code
asvGroupping$BarPlot_ASVG_LH_RA2

Each sample here is represented by it’s most abundant ASV Group.

Code
asvGroupping$LatDepth

Code
asvGroupping$LatLong

Code
asvGroupping$DepthLong

If the ASV Groupping is not present, than the point is removed.

Code
asvGroupping$LatDepth_stacked

Code
asvGroupping$LatLong_stacked

Code
asvGroupping$DepthLong_stacked

Just to remember the percentage of the data that has missing on the abiotic factors

Code
asvGroupping$MissingAbiotic
        ID_Sample           Cluster                RA      RawSeqCounts 
        0.0000000         0.0000000         0.0000000         0.0000000 
TotalRawSeqCounts          Latitude         Longitude             Depth 
        0.0000000         0.0000000         0.0000000         0.0000000 
           Cruise          Salinity            Oxygen       Temperature 
        0.0000000         0.0000000         0.1023303         0.0000000 
         Silicate               NO2               NO3               NOx 
        0.1337386         0.2411348         0.1773050         0.1884498 
              NH3          LogDepth 
        0.9017224         0.0000000 

I’m just plotting the samples for which we have information.

Univariate

Code
asvGroupping$MatrixAbioticsUnivar

Bivariate

Code
asvGroupping$MatrixAbioticsBiVar

Code
asvGroupping$TSNE_ASVs_Ait_2d

Code
asvGroupping$TSNE_ASVs_Ait_3d
Code
asvGroupping$umap_ASVs_Ait_2d

Code
asvGroupping$umap_ASVs_Ait_3d
Code
asvGroupping$TSNE_Sample_Ait_2d

Code
asvGroupping$TSNE_Sample_Ait_3d
Code
asvGroupping$umap_Sample_Ait_2d

Code
asvGroupping$umap_Sample_Ait_3d